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ABSTRACT This paper presents a system for combining the 
two present related, but yet separated modes of electronic 
document transfer, namely, electronic-mail (E-mail) and 
facsimile. Messages prepared at desk-top terminals, 
supporting compound document (i.e., text and graphics) 
creation and editing, and submitted to the E-mail network 
can thus be received by both interconnected terminals in the 
common E-mail architecture and by facsimile terminals 
connected to the public switched telephone networks 
(PSTN). The latter transmission is achieved by facsimile 
gateways which connect the two different architectures and 
perform the necessary protocol conversions. 



I, INTRODUCTION 

At present, two widely different modes of electronic 
transfer of documents prove successful: facsimile and elec- 
tronic mail (E-mail). The former is based on real-lime 
terminal-to-terminal transmissions between scanning devices 
attached to the public (circuit) switched telephone networks 
(PSTN). The latter employs computer networks, providing 
a message handling service with store-and-foiward and 
multi-addressing facilities. Despite the higher long-distance 
transmission costs, facsimile appears to be more popular. 
This popularity is due to the close relation with the familiar 
paper-oriented correspondence, the ability to handle 
graphics, the vaty large base of well-standardized terminals, 
and the ubiquitous telephone networks. Electrpnic mail, on 
the other hand, has gained support mainly because of the 
ASCII-oriented corporate communications with features 
such as persona] mailboxes, desk-top editing and document 
fding facilities [1]. 

Recentfy, the feasibiUty of incorporating graphical 
features (such as line drawings, maps or charts, signatures, 
mathematics or music symbols and non-ASCII text) into 
today's text-oriented E-mail systems was demonstrated, using 
terminals with writing facilities [2]. The technique, referred 
to as telegraphies^ permits transfer of compound documents 
(Ic^ containing both text and graphics) between 
interconnected terminals in a common message handling 
architecture. 



The purpose of this paper is to present a system 
bused on the techniques which combine the advantages of 
the two existing modes of compound document transfer (i.e., 
telegraphies [2] and facsimile). The system allows generation 
of compound documents on desk-top PC-based telegraphic 
terminals, and transmission of these documents both to 
other terminals in the common E-mail architecture, and to 
standard facsimile terminals connected to PSTN. The latter 
transmission is achieved via facsimile gateways which coiuiect 
the two different architectures by performing the necessary 
protocol conversions. The significance of this system can be 
projected by the ever growing use of the two services and by 
the desire to correspond beyond the E-mail architecture [3]. 
In addition, the former transmission requires only a narrow- 
band data link, or less transmission time, due to the more 
efficient coding of texts and line graphics, while facsimile in 
general produces much larger data size, leading to a 
requirement of wider bandwidth or more transmission time. 
Therefore, long-distance document transfer via B-mail and 
gateways would reduce communication costs, compared with 
the traditional facsimile transmissions. 



U. CODED REPRESENTATIONS OF 
COMPOUND DOCUMENTS 

A. Differential Chain CodincrDCQ of Une Drawings 

Line drawings generated by the movements of a 
stylus on a writing tablet [4] can be encoded based on 
spatial sampling and vector quantization performed by trans- 
lating a square coding ring along each curve [2] (Fig.1). A * 
curve / is thus described by the pen-down point and a 
vector chain, that is, 

I "Pq* <V|,Vj,...>, v^ € V, 

from which / can be reproduced with the pre-defined 
maximum quantization errors [6],[8]. Here, Kis a set of 8M 
(M = 1, 2, 3, ...) vectors defined by a square coding ring, 
i.c., « { «„ WfiM-i ^ shown in Fig.l. 
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Fig.2 Vector chain coding of line drawings using a square coding 
ring with 8M sampling points with spacing r. 

In encoding vector chains into binary jcprcsentaiion 
for transmission and storage, data reduction can be achieved 
by exploiting the statistical dependence between success ivt* 
vectors. For fluently drawn curves and properly chosen 
sampling step (determined by Mr), this dependence is 
demonstrated by a higher frequencies of occurrence of small 
displacements between successive vectors (7],[8]. Toward this 
end. Differential Chain Coding (DCC) encodes a vector as 
a relative vector and designates this with a shorter codeword 
if the displacement (relative to its predecessor) is +1, 0, or, 
-1. Here, 0 means no direction change, while +1 and -1 
indicate one sampling point to the left and rights respec- 
tively. Otherwise, the vector is encoded as an absolute vecior 
(the first vector of a vector chain is always encoded 
absolutely). For example, the curve shown in Fig.l is, in this 
way, encoded as an absolute/relative vector chain: u„ 0, 0, 
-1, 0, +1, 0. For coding rings with M = 1, 2 and 3, each 
absolute vector can be encoded as one ASCII byte; each of 
the three relative vector by 2 bits so thai up to 3 successive 
relative vectors {vector string) can be combined to fit into 
one 7-bit ASQI byte [21. 

Fig.2 shows the general construction of the data 
syntax for DCC encoded line drawings. Data for each curve 
is packed according to this syntax. The Line Control (LC) 
contains information such as coding-ring size M, line 
attributes (color, thickness). LC only appears if it is different 
from that of the preceding curve; a special code H (header) 
is used to indicate its presence. The code D (pen-down) 
indicates the start of data, which is immediately followed by 
the coordinates of the curve starting point. Then follow the 
first absolute vector and the encoded sequence of absolute 
and relative-vector strings. Finally, the code U (pen-up) 
terminates the data packet 
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B. Compound Document Representa tion Format 

Based on the above coding technique, a compound 
document page with text and line graphics is represented by 
ASCII bytes, and appears to message handling network in 
the format of pure ASCII stnngs. Receiving telegraphic 
terminals must decode it to restore the text and graphics on 
screen or in the form of hard copies. 

An E-mail document page thus consists of two blocks 
of ASCII codes representing, respectively, text and hne 
graphics on the page. We chose to identify the text block by 
an unique character combination "!h" in front of the block. 
The separation of the graphics block from the text block is 
achieved by heading each graphic ASCII line with "Ig". 
Obviously, message lengths are variable. In Fig.3, an 
example is shown, where (a) is the screen dump of a 
message consisting of both text and line graphics, and (b) is 
the print-out of the message codes. The total data size of 
this compound page is 1,434 bytes. 



urltedsl.llS): 

wlteCEntor flleMM td ntil 'ii 

ktitfntn.g): 

MTttBlndit); 

ratJtri.ch): 
»!• ch of 

a: urIicdil/PMllHMB): 
i: wltetltt/Il ••MM 
2: uriUtltt.'noi?':: 
i\ urltedit/BT 
2SS: urltadst. 
• U« 




Fig.2 Construction of DCC encoded vector chain data packet 
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FigJ An example of compound telegraphic document page. 
(&) screen dump, (b) message codes in telegraphic fonnat. 
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III. Structure of E-mail to Facsimile Convenion System 



IV. Description of Facsimile Gateways 



Facsimile gateway are used to connect the two 
different network architectures by performing the necessary 
protocol conversions. Fig.4 illustrates a model for this 
convenion. The essential part is the transcoding which 
converts an E-mai! message into the CCITT standard 
facsimile message format. 



E-mall 
architecture 



facsimile 
gateway 



facsimile 
architecture 



Coding and 
Representalion 

Data Transfer ■ 



SRA 



Fig.4 A model of protocol conversions between E-mail and 
facsimile. 



One of the design considerations for the system 
structure is that message conversions should take place 
closely to destination facsimile terminals. This is because of 
the increase of data size of generally 10-20 times when 
converting an E-mail message into the CCITT standard 
facsimile data format. A physical mapping of a gateway 
within the E-mail system is shown in Fig.5. Here, MTS 
(Message Transfer System) and SRA (Submission/ 
Reception Agent) are the entities of the E-mail system. 
MTS provides application-independent data transfer. SRA 
interacts, on one hand, with MTS in order to submit 
messages to or receive messages from it and, on the other 
hand, with user terminals in order to transfer messages via 
the mailboxes in which messages received from MTS and 
messages for submission to MTS are (temporarily) stored. 

In Fig.5, the gateway (GW) and SRA are co-resident in the 
same processing system. Any incoming messages intended 
for local facsimile terminals are relayed by SRA to the 
gateway for processing and delivery. The information relayed 
includes a relay-envelop, in addition to message contents. 
Among other items (such as sender's identifications, message 
identifications, service requests, etc.), destination facsimile 
terminals are specified in the relay-envelop, which are 
provided by a sender when submitting the message. 




FAX iacsimHe UT E-mail User TerminaJ 
FigJ . A facsimile gateway within a standard E-mail network. 



A- Servicie Elements 

The gateway software should support at least the 
following basic service elements: message conversions, 
facsimile communications and failed-delivery notifications. 

(a) Message Conversion: An E-mail message intended for 
facsimile terminals is received by SRA and relayed to a 
facsimile gateway for processing. The gateway converts the 
E-mail message into the format suitable for reception by 
facsimile terminals. 

(b) Facsimile Communications: A facsimile message is sent 
to one or more destination facsirnile terminals specified by 
the sender, using the standard communication procedures 
recommended by CCITT [9]. 

(c) Failed-delivery Notification: An attempt to deliver a 
facsimile message may fail for various reasons (invaUd 
facsimile number, terminal engagement, incompatibility, or 
defects of equipment). Depending on the type of failure, 
redial of the destinations may be invoked automatically after 
a defined time interval. The final failure vail cause the 
delivery to be aborted, and a failed-delivery notification 
(FDN) is provided- The FDN message is delivered to 
sender's mailbox by E-mail. 



B. Message Conversion Algorithm 

An E-mail message is ASCII-encoded and may 
consist of both text and line graphics. The corresponding 
facsimile message is obtained by pixel-based encoding of the 
overlay of text and graphics. To convert an E-mail message, 
the facsimile gateway maintains a two-dimensional bit-array 
which has numbers of rows and columns corresponding to 
the facsimile resolutions. For standard Group 3 fiacsimile [9], . 
this array has 1728 columns and 1076 rows. A bit in the 
array is '1*, corresponding to a black pixel, or *0', to a white 
pixel. The process of conversion includes three steps and 
involves a fair amount of computation and bit manipulation; 

The first step is to convert each line of text characters 
into a two-dimensional matrix of pixels and then place them 
into the appropriate position in the bit-array. The pixels for 
each character are taken from a look-up table containing the 
character font specified. 

The second step is to overlay line graphics onto the 
bit-array. This involves a decoding process of vector chains. 
Each vector (absolute or relative), is a straight line segment 
and can be mapped onto the bit-array using a line algorithm. 
For each chain, the starting point of one vector is the ending 
point of the previous vector, except the starting point of the 
first vector. This is the pen-down point given in the vector 
chain data packet (Fig.2). 

And finally, at the third step, each row of bits in the. 
bit-array is encoded using the standard fiacsimile coding 
scheme [9]. The coded data is buffered in a binary file, 
ready for transmission to the destination ^icsimile terminals. 
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V. Evaluations of E-mail — Facsimile System 

In this section^ we present an extensive comparison 
of the two present modes of compound document transfer: 
telegraphic and facsimile. The integration of the two systems 
based on the use of facsimile gateways combines most of the 
advantages of the both modes. 

In general, facsimile is based on real-time terminal-to- 
terminal transmission of (pre -prepared) paper-oriented 
documents via PSTN. The major advantages of it are: 

- internationally standardized, ensuring world-wide 
intenvorking of terminals from different manufac- 
turers, via PSTN, the most widely available netwdrk 
throughout the world; 

- a large (and still growing) installed base of facsimile 
terminals, which allow flexible exchanges of 
documents; 

- the ability to exchange compound documents (i.e., 
text and graphics); 

• low local transmission costs within most national 
PSTNs, due to the relative modest telephone tariffs 
for local calls. 

Against this, however, there are at present also a few 
significant disadvantages of facsimile: 

- no desktop editing and document filing facilities; 

- at most copy quality of document transfer; 

- large data size (e.g., by Group 3 facsimile coding, 
generally well above 25 kilobytes per page); 

- the high long-distance transmission costs of 
international telephone traffic apply. 

Most of these shortcomings, however, can be avoided 
in the telegraphic system, which also provides additional ad- 
vantages: 

- integration into multi-functional individualized 
computer-based workstations allowing desktop 
editing, transmission and receiving, and filing of 
documents; 

- first-level printout quality of documents at receiving 
ends; 

• much smaller data file sizes, due to the efficient 
ASCII coding of text and line graphics. 

- lower long-distance transmission cost compared to 
facsimile; 

- security is provided by the network architecture, e.g., 
personal mailboxes, passwords. 

- ready adaption to different networks (public or 
private, circuit or packet switching) in order to 
reduce transmission costs or to achieve point-to- 
multipoint connectivity. 

The transmission costs are of particular interest to 
many system users. For terminal- to-terminal document 
facsimile transfer, the costs are proportional to the data 



sizes. In Fig.6, a cost comparison is shown for document 
transfer by facsimile and telegraphic This comparison is 
based on transmitting documents from the Netherlands to 
four destinations, namely, domestic (within the Netherlands), 
another European countiy, the US, and Japan. We assume 
an average data size is 25 kilobytes for a facsimile page, and 
use a data reduction factor of 15 to derive the telegraphic 
data file size. The present structure of domestic telephone 
tariffs is seen to favor the local use of facsimile. However, 
international transmissions are always cheaper by E-mail, 
except for single-page transmission (1 page) to another 
European country. The cost reduction possible for 
international transmission of documents using E-mail and 
local facsimile gateways is considerable. 
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Fig. 6 Comparison of facsimile and E-mail document 
lransmi.wion costs by public networks, from The Netherlands to 
another terminal in (i) The Netherlands (NL), (ii) the United 
Kingdom (Europe), (iii) the United Slates (US), and (iv) Japan. 
(larifTs as of April 1, 1988). 

VI. Conclusions 

In this paper, a system for electronic transfer of 
documents is presented, that allows compound document 
interchange between E-mail and facsimile through facsimile 
gateways embedded in the E-mail architecture. The 
significance of this integration can be projected from the 
growing use of the two related, but separated services. A 
extensive comparison of the two modes of document transfer 
is performed. Major cost reductions in long distance 
document transfer is possible with respect to the traditional 
facsimile communications via the PSTN. This is brought 
about by the efficient DCC graphics encoding, and the 
present tariff structure of international telecommunications. 
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Abstract: 

Securing electronic mail presents 
some unique challenges in that it is a 
staged delivery application, where the 
message originator and message recipient 
are not usually in real-time comnuni cations 
with each other. For applications such as 
electronic mail, network level encryption 
devices leave vulnerabilities at 
intermediate relay points such as mail 
servers or message transfer agents (MTAs) ; 

Mall messages are typically 
decrypted, and stored RED 
(unprotected/unencrypted) at these 
repositories. Subsequently, they are 
re-encrypted and delivered to their 
destinations (see fig 1). 
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Security at the network layer can not 
provide application specific services 
associated with electronic mail such as 
signed receipts and digital signatures. 
These deficiencies can be remedied by 
providing end user to end user security 
where mail messages are black (protected/ 
encrypted) at all intermediate points. 
This paper describes the approach and 
implementation taken in building such a 
mail security device. It shall address the 
key security services provided and possible 
implementations. 



Approach: ^ 

The staged delivery (non-real time) 
nature of E-Mail levies, the: requirement 
that each security device must e^^^ 
securi ty rel evant process/dec 1 si oris ( i . e . , 
confidentiality, integrity, sofijrte 
authentication, access cbhtrpl) without;;ja 
real -time exchange . of information between 
cooperating parties. The Tatxie M^ 
electronic mall users ?ind, their diverse 
connectivity requirem^jrvts make the.,uie af 
secret keys distributed from a. trusted 
source Impractical and undesirable. The 
mail protocol will use a staged key 
information exchange in which a user to 
receive secure mail posts his certificate 
and some public key information on a roall 
or directory server. The sender uses this 
posted information along with his pWn . 
private information to process a j^^^^ 
key encryption key, which Is unique to the 
sender-receiver pair. 

The protocol ac^cpmi^odat«^ 
mul ti -addressi ng^ by using a i single message 
key to encrypt the message and then; u^^^ 
pal r-wi se key, ; f pr^ each ^d*ne?see, tp^ 
encrypt the message key. , IJhei^ro^^^ 
header includes iihe sender' supei:*^ 
It also Includes for each ?iddN^see, v t^^^^^^ 
public key information niaediBd /to-pro^ssp 
their pair-wise key, thie enci^t^ed ke/i^^ 
the integrity check. 

The electronic mail security devic* . 
performs the following securUy functions. 

ENCRYPT MESSAGE 
OUTPUT CREDENTIALS 
HASH MESSAGE 
GENERATE SIGNATURE 
READ CREDENTIALS 
GENERATE KEYS 
ENCRYPT KEYS 
DECRYPT KEYS 

DECRYPT MESSAGE V i : 

READ SIGNATURE 
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Credentials refer to a users certificate 
and public key information to include 
keying Information needed for the 
generation of digital signatures. 

System Operations: 

The transmission of a secure 
electronic mail message involves a sequence 
of steps as shown In Table I* Each user 
has unique keying information which is 
bound to a non-forgeable certificate. The 
certificate identifies the end user and 
their security privileges. Two users 
wishing to communicate exchange 
certificates and keying Information. The 
exchange results In a key that Is pair*w1se 
unique between the two end users. 
Establishing a pair-wise key requires the 
principles to gather each other's public 
information. This Information can be 
obtained from a mail/directory server much 



in the same manner that phone numbers are 
taken from telephone books. The 
certificate Is digitally signed by a 
certifying authority that all members 
trust. This eliminates the need for trust 
at the servers since certificates can 
neither be forged or altered. Certificates 
may also be cashed by end users to 
eliminate needless requests to the server. 

The mall message format is shown In 
Figure 2. 

Privacy of the message Is achieved 
because only the correct recipient 
possesses the pair-wise key and can decrypt 
the message (steps 4,5,6,7,11,12,13). The 
source of the message is authenticated by 
the binding between the sender/ receiver 
credentials and the pair-wise key (steps 
1,2,3,10). The source of the message can 



TABLE I - SECURE KAIL TRANSFER 



SENDER 



2. Receive Credentials of Recipients 

3. Transmit Credentials of Sender 

4. Process Key (Key Encryption Key) 

5. Generate Key (Traffic Encryption Key) 

6. Encrypt Key (Encrypt TEK with KEK) 

7. Encrypt Message with TEK 

8. Hash Message 

9. Generate Signature 



RECEIVER 

1. Transmit Recipients Credentials 



17. Read Signature 



10* Receive Senders Credentials 

11. Process Key (KEK) 

12. Decrypt Key (TEK) 

13. Decrypt Message with TEK 

14. Read Signature 

15. Hash Message 

16. Generate Signature 
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Fig. 2 Voice Message Format 



also be authenticated by the sender 
cryptographically signing the mail message. 
This can be useful If authentication 
without privacy is desired ^steps 
8»9, 14,15). Authentication of end users 
identity Is achieved through the exchange 
of non-forgeable certificates (steps 



1,2,3,10). Integrity of the message is 
maintained through use of a digital : ' 
signature appended to the message. 
Insertion, deletion or modification to ;the 
mall message results in an inval^id 
signature at the receiver (stepS;i 
8,9,14,15). The mail transfer scenario 
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provides the capability of sending 
registered mall. The sender can not deny 
having sent mall since the sender is the 
only party to have his unique signature key 
(proof of origin) (steps 8,9). The 
receiver provides proof of delivery by 
resigning the hashword on the message with 
his unique signature key and sending the 
resigned hashword back to the sender (steps 
15,16). 

If link, network or transport layer 
security services are employed, they will 
act Independently of the application layer 
security services discussed in this paper. 
The new architecture that arises from using 
the secure E-Mail device Is shown In Figure 



3. A network layer security device may be 
used to protect mail header Information or 
to secure other types of data traffic not 
secured by the mail security device. 

Summary: This paper described an approach 
at protecting mall messages along the 
entire path from originator to recipient. 
The messages are protected during storage 
and processing at all intermediate points. 
The security can be Implemented independent 
of the network layer architecture or a 
network layer security device. The key 
distribution technique Is suitable for 
client-server architectures such as E-mail 
where the number of users is large and 
connectivity Is diverse. 
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Abstract 

The design and implementation of a conceptual model, 
CAFE (a Categorization Assistant For E-mail), is 
described. The model supports the organization, searching, 
and retrieval of information in e-mail. Three modes are 
available for satisfying the users' needs in various 
situations: the Busy mode for intermittent use at times of 
high stress, the Cool mode for continuous use at the 
computer, and the Curious mode for sporadic use when 
exploring and (re-) organizing messages when more time is 
at hand. The design of the model is motivated partly by the 
results of a case study of categorization on the computer 
screen, and partly by a survey of e-mail clients. The case 
study was inspired by cognitive science theories. The model 
is related to information seeking theories in electronic 
environments. In the implementation each mode required 
using a different technique. The Busy mode uses the text- 
based Naive Bayesian algorithm, the Cool mode uses e- 
mail filtering rules, and the Curious mode uses a combination 
of clustering techniques known as Scatter/Gather 

1. Introduction 

Electronic mail (e-mail) is the preferred communication 
medium for an increasingly growing number of users 
around the world. It is one of the "killer applications" of the 
network world today. Moreover, e-mail affects social 
factors and patterns of communication within an 
organization [28]. 

E-mail is used both at home and at work and important 
e-mail messages are increasingly often being mixed with 
less important messages in the evergrowing flow of 
information between users. Increasingly often users find it 
difficult to search and retrieve information in e-mail 
messages. Furthermore, people tend to collect and store 
information for later use, for personal business and, 
typically, for supporting decision-making [22]. In e-mail 
and computer conferencing systems, such as KOM [20] and 
netnews (Usenet News), the storing of information is easy, 
while the retrieval of it often is more difficult. Moreover, it 
is easy to quickly disseminate information to many 
recipients at the same time. This asymmetry is characteristic 
of the electronic messaging systems being used today. As 



Marchionini [16] (p. 1) states, the general consequences of die 
information society we live in are threefold: we have larger 
volumes of information, new forms and aggregations of 
information, and new tools for working with information. 
Furthermore, we also have more complicated information 
needs [9]. 

The e-mail user is rapidly finding herself in dire need of 
some kind of help in structuring and getting a better 
overview of the information contained in her e-mail 
messages. Furthermore, she is in need of retrieving the 
information in better ways. The amount of effort required to 
retrieve relevant information is related to the amount of 
information stored. Among the major reasons for the 
information retrieval difficulty are the lack of explicit 
semantic clustering of (or linkages between) relevant 
information and the limits of conventional search techniques 
using keywords (either fiill text or index-based) [39]. 
Especially, the organization of incoming messages 
becomes more critical as the amount of e-mail messages in 
the system grows. A system with support for classifying the 
information would help the recipient in her task of reading 
and selecting relevant messages and avoiding "junk mail" 
or other messages of low interest. Moreover, the support for 
the management of the information contained in e-mail 
messages has to consider both the static storage of 
messages and the dynamic flow of incoming messages. 
Finally, to make it possible for the user to satisfy her 
information needs, the system must allow the iiser to search 
for messages by entering queries — and examine the retrieved 
messages — interactively, and with a response time of only a 
couple of seconds. 

In this paper we describe a conceptual model for the 
information management in e-mail. We look for inspiration 
in two places: cognitive science theories for categorization, 
and available techniques for retrieving and displaying e- 
mail messages, and organizing them on the computer 
screen. We concentrate our efforts mainly on textual 
information because e-mail is (still) a mostly textual 
medium. In section 2, we start by first looking at 
categorization, which is the basic principle behind 
information management. In section 2.2, the conclusions 
from a case study of people's categorization of e-mail 
messages on the computer screen are presented. The 
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findings from a survey of the filtering, organization, and 
visualization capabilities of some currently available 
clients are summarized in section 2,3. Based on 
conclusions firom the case study and the survey, we then 
construct our conceptual model CAFE in section 3 and 
present a prototype implementing the model in section 4. 
We conclude our work and give some directions for future 
work in sections 5 and 6, respectively. 

2. Background 

2.1 Cognitive science theories for information 
categorization 

Categorization of information is studied in both cognitive 
science and information retrieval and filtering (IRIF). 
According to psychologists there are two general and basic 
principles for creating categories: cognitive economy and 
perceived world structure [25]. These principles state that 
the ftinction of categories is to provide maximum 
information with the least cognitive effort, and the 
attributes, or features, that an individual will perceive in the 
world, and thus use for categorization of stimuli, are 
determined by the needs of the individuals. Moreover, 
these needs change over time and with the physical and 
social environment. The maximum information with least 
cognitive effort is achieved if categories map the perceived 
world structure as closely as possible [25]. Since the 
perceived world is different for each individual, the 
categories are indeed personal to the individuals using 
them. Similarity plays a central role in placing different 
items into a single category. The similarity of the items in 
a category varies, but to a certain degree — people want to 
minimize within-category variability of similarities 
between items while maximizing between-category 
variability [27]. However, similarity is really "in the eye of the 
beholder" and does not alone explain categorization, since no 
constraints are provided on what is to count as a feature or 
attribute [32]. 

Categories and personal knowledge structures are of 
central interest to cognitive psychology researchers. The 
cognitive psychologists' models of categorization and the 
human memory can provide useful clues for making the 
retrieval of information easier and more intuitive [33] 
(p. 178). Through the history, different theories for how 
categories are structured and created by humans have 
evolved [7]. Three examples are the classical view, the 
probabilistic view, and the theory-dependent view. 

The first two define categories solely based on the 
features or attributes that the items put in the categories 
have [25][32]. Of these, the classical view, first presented 
by Plato, describes categories as structured around features 
that define all of the items in each category. The 
probabilistic view, on the other hand, describes categories 



as either organized around a prototype or best example, or 
represented by all the individual instances that constitute it. 
The first variant of the probabilistic view is called the 
prototype view and the second variant is called the 
exemplar view. 

In the theory-dependent view categories are based on 
knowledge and world theories (theories that humans use in 
categorization tasks). In other words, people's individual 
theories determine the features that will be important for a 
category. 

The research on categorization in cognitive science has 
progressed from the classical to the probabilistic view and 
from the idea that concepts are organized around similarity 
to the idea that concepts are organized around theories 
(Medin 1989, in [32]). 

Two examples of using the above mentioned theories for 
categorization in the IRIF area are neural networks (for 
example, [17]) and fiizzy sets [24] — the latter is, by the 
way, an attempt to use Rosch's prototype view [25] for 
modelling categories. Inspired by the cognitive theories, 
we designed a case study to learn more about the physical 
and mental processes of people when they sort messages on 
the computer screen. 

2.2 A case study of categorization on the computer 
screen 

The purpose of the case study was to examine how people 
create structures on the computer screen and how the 
structures evolve when increasingly more messages are 
sorted into them. A special structure editor was developed 
for the case study (fig. I; [12]). Twelve users of e-mail 
acted as subjects. Each subject was asked to sort a number 
of previously unseen e-mail messages into categories of 
their own devising. Five types of queries were used to test 
the efficiency of category structures for retrieval, ranging 
from simple keyword-based queries ("What messages 
contain the URL http://www...?") to situation-based 
queries ("What messages contain relevant information if 
you are a music teacher and you want to start exploring 
music resources on the web?"). Among other things, the 
number of relevant hits was counted and the retrieval time 
was measured. Also, we wanted to see how different 
representations of categories influence the development of 
structures and the retrieval of messages. We used three 
different representations of the messages and the categories 
on the computer screen were used: the desktop metaphor 
(cf the Macintosh environment), the tree structure (cf. the 
file manager in Windows 95), and the mind map [2]. The 
mind map (fig. 1) is a two-dimensional, hierarchical 
structure that provides the subjects with different layout 
functions, such as line thickness and red lines for links 
between categories [12], for organizing messages and 
categories. Furthermore, we wished to examine what 
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Figure 1 . An example of a mind map as displayed 
in the structure editor used in the case study. 

happens with the structures when messages with contents 
that differ from the main contents of the other messages 
("junk mail") are presented to people and how these 
messages are sorted into the structures. For this purpose, two 
different mailing lists were used: one with messages relevant 
to the subjects' background (about choir singing [41]) and 
another one with supposedly irrelevant messages. Finally, we 
hoped to learn more about what different features of the 
messages people use in the categorization procedure — the 
categories of each subject were regularly measured for 
their within-category and between-category variabilities 
and the criteria used for grouping messages were 
determined. 

The case study was a continuation and expansion of a 
previous preliminary study of people's categorization of 
text (e-mail messages and proverbs) on pieces of paper on 
a table [3] [23] [30], For details about the set-up for the case 
study, see [31]. 

The categories created by the subjects were not 
perfect — many subjects stated that they were not satisfied 
with the structures they had created. However, the results 
do give some hints for some usable information management 
and categorization principles. 

The desktop was the most familiar structure to the 
subjects. However, it was very cumbersome to use and 
offered a poor overview of the collection of messages. The 
subjects in the desktop group clearly wanted some other 
means of navigating and grouping messages in the 
structure. 

The hierarchical tree structure was the most efficient 
one for retrieving messages. The subjects were able to 
easily browse the categories in an orderly fashion. This was 
awkward and time-consuming to do in the desktop 
structure, and even more so in the mind map structure. The 
tree structure was very familiar to most of the subjects, 
although the structure became cluttered when increasingly 
more categories were created. 



The mind map was the least familiar representation of 
the three used. The two-dimensional format of the mind 
map seems to have had, at the same time, a stimulating and 
a constraining effect on the sorting procedure. The main 
advantage with the mind map was that the whole 
organization of message categories was visible and 
available at the same time. Furthermore, it could be highly 
personalized in a spatial and graphical way, where related 
items and categories were clustered via spatial proximity. 

The number of categories was in the mind map group 
the smallest in mean, but at the same time the range was the 
largest. Furthermore, the subjects in the mind map group 
seemed to form more associations with matching messages 
than the subjects of the other groups did to locate messages. 
The Subject line of the messages was extensively used for 
naming the categories, which is a result similar to an 
investigation made by the IntFilter Project at Stockholm 
University [11] (p. 26). The subjects were heavily 
reorganizing their structure for the categorization of the 
"junk mail" that was presented to the subjects. The type of 
messages influenced more the number of categories than 
the number of messages. Finally, there seems to be a need 
for a flexible way of changing the view of categories 
(folders), depending on the task (searching, sorting, etc.) 
that is to be performed. For a more detailed description of the 
results and analysis of the case study, see [31]. 

2.3 State-of-the-art of information management in 
e-mail clients 

Studies have shown a wide level of diversity in the way 
people use their e-mail clients and also a wide range of 
tasks for which they are used [14] [29]. One problem with 
e-mail systems is that the e-mail client often is only a thin 
layer on top of the delivery system [8]. In a survey of e-mail 
clients available for Internet-style e-mail (e-mail using 
SMTP and POP3/IMAP protocols) we investigated what 
different functions were available for the organization of 
messages and visualization of collections of messages [31]. 
The survey revealed a great uniformity of available 
functions. Filtering functions for handling incoming 
messages are common, as are the use of folders for storing 
messages and two-paned or three-paned displays (fig. 3 in 
section 4) for presenting messages on the screen. 

The most basic information management offered to the 
user in the e-mail client consists of the following functions: 
incoming messages are put (automatically by the delivery 
system) in an inbox and, typically, outgoing messages into 
an outboxy the user can read, print, compose, and send 
messages, and she can create folders (mailboxes) and 
manually file messages into the folders for permanent 
storage. 

The folders can be created according to an organization 
principle of the user's own devising and often in a 
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hierarchy. Usually, messages can be sorted into folders by 
way of a drag-and-drop interface that lets the user move 
around messages among folders with greater ease. Other 
functions or features commonly available in the e-mail 
client are the following: 

• there is a folder list, a message summary, and a one- 
message preview window 

• the filtering system looks at the text in the From and 
Subject lines of a message and, depending on the filter 
rules, moves messages into folders 

• the messages can be searched for words 

• the messages can be addressed through the use of ali- 
ases and addresses can be stored in an address book. 

Most up-to-date clients offer a whole system of filters or 
rules that the user can use for automatically performing 
actions on (route, print, and otherwise process) incoming 
messages, BeyondMail [35] is one example and Exmh [36] 
another, each representing different approaches. 
BeyondMail is a commercial product. It is part of an 
integrated environment called groupware, which also 
includes bulletin boards, group schedules, and document 
flow, but also available as a standalone application with a 
lot of usable functions for organization of e-mail. Exmh, on 
the other hand, is a freely available and highly 
customizable program, with a multitude of user-definable 
functions for filtering, organization, and getting an 
overview of e-mail. Some clients even provide 
programming tools — powerful scripting languages — that 
can be used to build applications or trigger elaborate 
processes based on incoming e-mail [35] [42]. Many times, 
however, these tools are hard to use, even at a basic level, 
e.g., Ishmail's patterns for rules [40]. Finally, the search 
functions vary from simple searching of words in message 
headers in one folder to advanced Boolean searching in all 
folders at the same time — cf. Exmh [36]. 

Most commonly, the vendors of the commercially 
available e-mail clients in our survey make the assumption 
that both sender and recipient of e-mail use the same 
product, i.e., the vendor's product. This makes it of course 
easier to incorporate handling of, e.g., priorities of 
messages (Urgent, Regular, etc.) and forms for special 
types of messages (meeting, phone message, etc.) — cf 
BeyondMail [35]. These vendor-specific features can be of 
valuable use when creating a personalized structure of 
message categories. They can make the structure more 
meaningftil and flexible to the individual user. 
Furthermore, sorting the received messages into categories 
according to priority coding or type of message helps 
making the messages more retrievable and viewable in new 
ways. However, few e-mail clients fully support this 
functionality without relying on vendor-specific features. 



3. CAFE: The conceptual model 

How much of the work of classification of a message can 
be put on the sender and the recipient of the message 
respectively? We argue that the asymmetry in e-mail (see 
section 1) is both necessary and unavoidable. The sender 
does not want to manually classify a message, since it 
would mean more work. One solution could be to introduce 
a common collection of categories for e-mail users and 
their messages. Using this classification system, software 
could be used to automatically classify messages before 
sending them. However, this would mean that each and 
every e-mail user should have the same kind of software for 
classifying and recognizing messages. Furthermore, the 
classification system would most certainly be difficult to 
maintain. Managing the software would be practically 
impossible, considering the wide variety of e-mail clients 
available [31]. Also, the classification system can be 
misused, e.g., classifying messages as being of high 
priority when they are not [28] (p. 75). The burden of 
categorization of messages should be put on the recipient's 
side instead. Hence, our aim is to aid the recipient in the 
classification, organization, and getting an overview of her 
set of messages. 

Furthermore, putting the solution in one "monolithic" 
package, i.e., using one technique to take care of all cases 
of message handling, is not what we want to do. We want 
to make it possible for the recipient of messages to use 
different methods when looking at the information in her e- 
mail. The current state of mind of the recipient is important. 
For example, does she have little or much time to spend on 
reading messages and what is the information she needs at 
the moment? Therefore, we want to make is possible for the 
user to explicitly tell the e-mail client what her current state is. 

According to the principle of perceived world structure 
(see section 2), a computerized system for text 
categorization should be flexible in its management of the 
text and its representation of the user. By this we mean that 
text should be possible to classify in different ways, 
according to the needs of the user. This flexibility requires 
domain knowledge that changes over time. The knowledge 
about texts and users is usually modelled as a combination 
of the document representation and the (explicitly or 
implicitly defined) profiles of the user in the system. An 
example of a categorization system with these features is 
given by [13]. 

Our conceptual model for a Categorization Assistant 
For E-mail (CAFE) makes use of three different modes for 
specifying the user's state. Through the different modes, 
CAFE is designed to support different strategies for 
reading, sorting, and searching messages. Both analytical 
and browsing strategies are supported. Generally speaking, 
these strategies are central for overcoming the information 
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problem [16] (pp. 7-8) and alleviating the user's 
"anomalous state of knowledge" (ASK) [1]. The conceptual 
model is shown in fig. 2. The modes are: 



Message stream 
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Message 
storage 





Modes 
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Figure 2. The conceptual model CAFE. 

• The Busy mode is designed to be used intermittently, for 
locating important messages among the latest messages 
in the message storage. The user is typically in a situa- 
tion when she has little time for reading new and 
unseen messages. The user is presented with a priori- 
tized list of messages, grouped into the categories (fold- 
ers) Important, 2nd Class, and Junk [6]. 

• The Cool mode is the default mode designed to be used 
continuously. It operates on the incoming message 
stream. The Cool mode is used in situations when the 
user can read messages little by little during her session 
at the computer. The user's own categories are used for 
storing the messages. 

• The Curious mode is designed to be used sporadically 
(typically once a day), in situations when the user has 
time to spare. The mode is employed when the user 
wants to locate, organize, or reorganize previously 
stored messages. It supports the analysis of a larger col- 
lection of messages, typically messages from a mailing 
list, in all or a subset of the folders in the message stor- 
age. The user is presented with groupings of messages 
where she interactively can select categories to "zoom 
in on" and investigate further. 

The main guiding principle in the design of the conceptual 
model of CAFE has been to let it contain alternative 
representations. The user is allowed to select from the three 
representations (modes) according to her current personal 
style, experience, and information problem. This approach 
with using alternative representations is argued for by [16] 
(p. 140). The main argument is that cognitive science offers 
a variety of theories about how humans categorize and 
represent information and knowledge (see section 2. 1). The 



need for flexibility in the representation of categories was 
also implied by the results of our case study of 
categorization in e-mail (see section 2,2). Moreover, the 
use and usage of e-mail in general [14] have been of great 
concern in the design of CAFE. 

A general design for a strategy to use in any system for 
accessing information is to use general queries and probes 
to identify a neighbourhood of interest, and then browse 
and filter [16] (p. 181). This is especially supported in the 
Curious mode in CAFE. The Curious mode and the other 
modes can be characterized by their different ways of 
viewing the information in e-mail. Messages already read 
and stored represent a collection that is static in its nature. 
New and unseen messages lying in the inbox or in folders 
form a semi-dynamic collection of messages, i.e., their 
state is likely to change in the near future. The incoming 
messages, finally, form a dynamic collection (a stream) of 
messages waiting to be classified and acted upon by the 
user or the system. In other words, we get the following 
characteristics of the different modes: 

• in the Busy mode, we have a semi-dynamic or static 
deposit of messages (new and unread) on which 
dynamic, automatically created queries are applied 

• in the Cool mode, we have a dynamic stream of mes- 
sages and a set of static, user-defined queries that are 
applied to it 

• in the Curious mode, we have a static message storage 
on which dynamic, interactively created queries are 
applied. 

Our aim has been to use simple techniques and metrics, 
whose function and behaviour can be easily understood by 
the user — at least intuitively. A prototype of the conceptual 
model is presented in the next section. 

4. The prototype of CAFE 

The implementation of CAFE is based on the e-mail client 
called Exmh [36]. Exmh was originally conceived with the 
assumption that the user would want to customize it — four 
ways of customization are available, depending on the 
desired extent [21]. Moreover, users are allowed to alter 
and make additions to the source code of Exmh, something 
which is a major bonus when developing an e-mail client. 
Exmh has been used as a basis for the development of 
different extensions by many users [5] [3 8]. Finally, our 
implementation makes use of known algorithms and 
techniques in IRIF. 

Many people depend on getting e-mail reliably. 
Furthermore, most people (if not all) do not want the 
system to automatically delete e-mail without letting the 
user read it first [29], Also, you can lose some or all of your 
incoming e-mail if your automatic e-mail handling is not 
working correctly or is giving you the right feedback. All 
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of these issues have been among the central considerations 
in the implementation. Another consideration has been to 
not impose a specific mail handling procedure or ordering 
of actions. However, out of practical reasons this cannot be 
avoided. For example, in the Busy mode the user will most 
certainly want to refile some messages for later action, so 
we added the ToDo folder. Each mode uses a different 
information retrieval (IR) or text categorization technique. 
In this regard, the modes are described in more detail 
below. 

The Busy mode. The Busy mode is illustrated by fig. 3. 
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Figure 3. The main window of Exmh, with the Busy 
mode of CAFE active. 

The user is currently browsing the folder containing 
important messages. The contents of the menu under the 
Mode button are also shown in the figure. The folder 
display in the top pane of the window contains the folders 
used in the Busy mode: 

• the three main folders Important, 2nd Class, and Junk, 
representing important messages, second class mes- 
sages, and junk messages, respectively 

• the standard folders inbox, outbox, draft, and ToDo, 
representing incoming messages, outgoing messages, 
half-completed messages, and messages to be acted 
upon, respectively. 

The routing of messages into the three main folders is done 



using the text-based Naive Bayesian learning algorithm. 
The algorithm uses Bayes' Theorem from probability 
theory. This algorithm makes the computations for training 
and classification simple, and it also performs rather well in 
practical applications of classification of text documents — 
see, for example [19]. It is employed via ifile, a filtering 
program developed by Jason Rennie at Carnegie Mellon 
University [38]. Messages are prioritized in ifile by giving the 
words on Subject and From lines higher weights in the 
computations. 

The user can refile messages, either moving wrongly 
categorized messages into their right folders (folders that 
are available in the Busy mode) or saving messages for 
later action in the ToDo folder. The learning algorithm 
updates its parameters accordingly when the user refiles 
messages to any of the three main folders. However, the 
refiling of messages to the ToDo folder does not affect the 
algorithm. This is because, learning the system to file 
messages into the ToDo folder borders the area of workflow 
and work procedures, which are outside of the scope of our 
work. 

Changing to or from the Busy mode changes the folder 
display. The standard folders (and the Junk folder) are used 
in all modes and remain the same. The messages in the 
three main folders are automatically moved to the user- 
defined folders when the user switches to the Cool mode, 
using the standard filtering rules of the Cool mode. 
Messages already in Junk are not moved, however. 

The Cool mode. The folder display of the Cool mode 
(the top pane in fig. 3) shows the user-defined folders. 
These folders are used as targets for the user-defined rules that 
filter incoming messages. Messages that have not been filtered 
by the rules are left in the inbox and can be moved manually 
to their right folders by the user. 

The filter rules are defined by the user in a separate filter 
file, one rule per line, using a text editor. The syntax of the 
rules is [21] (p. 374-383): 

field pattern action result string 

An example of a filter rule looks like this: 

fivm joe qpipe A '7x/y/rcvstore +JoeLetters ** 

Here, the field argument is from and the pattern argument is 
7oe, meaning that messages from joe will be acted upon. 
The result argument .4 means that if the field and pattern are 
matched, an action is performed. In this case, the action is 
to move the messages to the folder JoeLetters (defined in 
the string argument). The action argument qpipe is used to 
start a program. Since the result argument is A, the message 
is also marked "delivered", which means that it cannot 
match any more rules. In this example, it starts the rcvstore 
program defined in the string argument, which performs the 
actual filing of the message. Note that the categories in the 
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Cool mode are created by the user and separate from those 
used in the Busy mode (see above). 

The Curious mode. Matching the user's need with 
documents in a collection is a challenge in any IR system. 
The Curious mode is designed to meet the challenge of "the 
anomalous state of knowledge", at least to some extent. 

The Curious mode uses its own window for the display 
and selection of groupings of messages (fig. 4). Each 
grouping is shown in a scroll window of its own. A 
summary of each grouping is displayed in the header of 
each scroll window, consisting of the grouping number, the 
number of messages in it, and the ten most conmion words 
in the grouping. To make use of this mode, the user 
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Figure 4. An exannple of the results of Scatter/ 
Gather in the Curious mode. 

typically selects a set of folders when she is in the Cool 
mode. The folders of the Busy mode can also be employed. 
The selection is done via a combination of keys that is 
consistent with the way Exmh is used. Thereafter, the user 
changes the mode to the Curious mode via the Menu button 
in Exmh (fig. 3), opening a separate window on the screen. 

The messages are grouped into new categories based on 
groupings (clusters) that are created by a variant of the 



Scatter/Gather algorithm [4]. The algorithm uses a non- 
hierarchical partitioning strategy to cluster n documents 
into k groups. A strategy called Buckshot [4] is used to find 
initial centres for the clusters. Buckshot is non- 
deterministic, i.e., different (random) centres are output 
each time the same document set is given. The centres are 
used as starting points in the clustering algorithm that is 
employed to organize a set of documents into a given 
number of topic-coherent groups. We use Ward's method, 
a hierarchical agglomerative clustering method [9]. It uses 
the minimum variance measure to calculate "closeness" 
between points (documents). Though it is sensitive to 
outliers (documents far from the cluster centres), Ward's 
method produces compact groups of well distributed size 
and is deemed as appropriate for our domain. The input to 
the clustering algorithm are a pairwise similarity measure 
and the number of desired clusters. We use Dice's 
coefficient, since the documents are short and execution 
time is critical [26] [9]. The number of desired clusters can 
be set by the user via the Preferences window in Exmh (the 
default is 5). The assignment of documents to cluster 
centres is only done twice, since the assignment process 
makes its greatest gains in the first few steps [4]. The 
second time, new cluster centres are computed using the m 
most central documents in each group. We use the 70 % of 
the documents that are "closest" according to the minimum 
variance measure used in Ward's method. Since the Scatter/ 
Gather algorithm is interactive, Buckshot is therefore 
optimized for speed rather than accxiracy (i.e., the rate of 
misclassification). 

4,1 A worked example 

Suppose the user has just arrived at her computer and starts 
her e-mail client (typically by clicking on an icon). 
Furthermore, suppose she is in a hurry, so she wants to see 
all important messages among all unseen and new 
messages. Thus, she changes the mode to Busy (the Cool 
mode is the default when the e-mail client is started) by 
selecting the mode from the menu under the Mode button. 
Now, the important messages are made available in a 
separate folder named Important (fig. 3). After doing some 
quick reading the user refiles a couple of messages into the 
ToDo folder, some other messages into the Junk folder, and 
another couple of messages into the 2nd Class folder. The 
user then exits the e-mail client, since she has skimmed 
through her new and unseen e-mail and is in a hurry to other 
places. Note that the filter rules of the Cool mode continue 
to work in the background and sort incoming messages into 
the user-defined folders available in the Cool mode. 

Suppose the user comes back, now with more time on 
her hand. Let us say that she is interested in examining the 
messages from a mailing list called VOCALIST [41] that 
she has stored in the folder with the same name. The 
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messages have previously been routed to the folder by the 
user-defined rules in the Cool mode. The first action that 
she takes is to mark the VOCALIST folder — she could also 
have continued to select other folders by using the same 
marking procedure. She then changes to the Curious mode 
via the menu under the Mode button (fig. 3). 

A separate window for the Curious mode appears, with 
a message asking the user to wait while the system creates 
groupings out of the selected folder (or folders) of 
messages. After a while, the result is shown (fig. 4). Each 
grouping is shown in a scroll window of its own. A 
summary of each grouping is displayed in the header of 
each scroll window, consisting of the grouping number, the 
number of messages in it, and the ten most common words 
in the grouping. Let us say that the user is especially 
interested in "voice types". She selects the groups with 
summaries containing the words "voice" and "type" (the 
first two groups in fig, 4) by clicking on the button in the 
header of the scroll windows. She then clicks the Scatter 
button to see new groupings of the newly selected 
messages. In this way, the user iteratively refines the search 
for interesting messages. When the user has satisfied her 
information needs, she has the option to save the groupings 
as new folders, before she quits the Curious mode by 
dismissing the window. 

5. Discussion and conclusion 

It is clear that the capability to manage heavy e-mail load is 
rapidly moving from a an extra feature, to something that is 
absolutely mandatory. 

By examining individuals* categorization processes and 
organization of messages on the computer screen, we were 
able to extract a number of interesting concepts and ideas 
for both an interface an a new conceptual model for 
handling e-mail messages. The messages can be viewed as 
either a continuous stream of messages or a stored 
collection of messages. The conceptual model, a 
Categorization Assistant For E-mail (CAFE), consists of 
three modes: the Busy mode, the Cool mode, and the 
Curious mode. Each mode treats the messages in different 
ways. Each mode is also used in a different situation, 
depending on the user's "state of mind" and the amount of 
time that she has available. 

With CAFE, the filtering functions of the e-mail client 
can be personalized. That is, the sorting of messages into 
folders (categories) can be done in more than one way. The 
Cool mode gives the user full control of simple filtering 
rules. Typically, the messages are sorted into categories that 
are topic-oriented or sender-oriented, i.e., based on the 
Subject or From lines of messages. More advanced rules 
can be derived via the machine leaming algorithm in the 
Busy mode. The algorithm complements the filtering rules 
in the Cool mode. With the Scatter/Gather algorithm in the 



Curious mode the user can first seek broadly relevant 
information and then browse to reach the goal. Here, the 
user can make queries that she even cannot state, simply by 
selecting groups instead of individual queries. Apart from 
the explorative possibilities, a certain level of serendipity 
can also be achieved via the Curious mode. 

As Marchionini [16] (p. 44) points out, the cost of 
flexible representations of information is in the various 
mechanisms for controlling the different representations. 
The mechanisms — usually paging, scrolling, and 
jumping — require the user to develop new strategies for 
manipulating the physical structure of the information, e.g., 
the length of a message or multiple windows on the screen 
[16], In CAFE, in this regard, we have not introduced any 
new mechanisms not available in the e-mail client Exmh 
before. For example, the folders are still represented in the 
same way, i.e., as collections of browsable message 
summaries in scroll windows. 

Our experience suggests that in general, for the user to 
be able to formulate her information need, a successful 
implementation should make it possible for the user to use 
her experience and expertise. 

Browsing is a central strategy in accessing information. 
In a terminology borrowed from Marchionini [16] this 
strategy can be supported using either probes, filters^ or 
templates. In our prototype: 

• the "probes" are represented by the different search 
functions, such as Scatter/Gather in the Curious mode^ 

• the "filters" are represented by the filtering rules in the 
Cool mode 

• the "templates" are represented by the predefined fold- 
ers in the Busy mode'^. 

The implementation of the hierarchical clustering 
algorithm (Ward's method) in the Curious mode is 
currently too slow. Also, two documents with the same 
content, but written in different languages, are not treated as 
similar documents, since similarity is based on keywords, 
which is a drawback of the simple techniques chosen. 
Furthermore, large clusters should be split into two 
clusters. 

Concludingly, the locus of control is still close to the 
user in CAFE, who gets a handful of new and usable 
possibilities of handling her e-mail. Furthermore, we 
alleviate some of the cognitive demand on the user in 
refining her "anomalous state of knowledge". Finally, the 
different modes ameliorate the possibilities to personalize 
the information management in e-mail. 



1 . Furthermore, Exmh has Glimpse [37] as a built-in search 
engine. 

2. In addition, Exmh uses, among other things, the compo- 
nents file for creating templates [2 1 ]. 
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6. Future work 

The conceptual model can be extended in several ways: 
more personalized modes resembling user profiles [18] can 
be added, the data in address book, calendar, and other 
"add-ons" associated with the e-mail client can also be 
included in the model. Examples of add-ons are addressing 
through aliases, adding message signatures, supporting 
"advanced" text formatting, and spell checking. 
Information in other domains, such as netnews messages 
and personal document collections could also be managed. 
Fleming and Kilgour [8] have described an approach to 
restructuring the domain of e-mail, deriving message 
prototypes (templates) directly from users* formal or 
informal message structures. Incorporating these ideas, 
which relate to visual programming, can make the 
conceptual model even more flexible. For example, this 
could make searches based on message structures [15] such 
as "review form" and "meeting announcement" possible. 

One future direction for our work is generalizing it to 
other kinds of information and, also, scaling it up for larger 
volumes of unrestricted text. The Scatter/Gather algorithm 
was originally designed for large document databases: 
30 MB of ASCII text in about 5000 New York Times News 
Service articles [4]. 

The Curious mode can be applied to the results of a 
search with Glimpse [37] in Exmh and thus enabling the 
user to view the search results in another way [10]. An 
important part is the definition and handling of the rules in 
the Cool mode, which really should be done via a special 
user interface [29]. However, we let the user define and edit 
the rules in an ordinary text file in the current 
implementation. 

The first concrete goal is to optimize the execution of the 
algorithms in the prototype and make an evaluation of the 
prototype with real users. Exmh is used by other persons at 
our department, which opens up the possibility to make an 
evaluation of CAFE in a real environment. 

There are many optimizations that can be done 
concerning the execution of the algorithm and the language 
that it is implemented in (Perl [34]), including changing the 
language completely for substantial efficiency savings. The 
initial cluster centres in the algorithm might be selected 
based on how dissimilar they are, e.g., similarity measure 
less than 0.05, instead of a random selection. We are 
considering making the prototype available on the Internet 
for Exmh users. 
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Abstract 

Fielding secure computer systems requires tradeoffs 
between fiinctionaiiiy, flexibility, and security to meet the 
userf needs. Multilevel secure (MLS) computer systems 
provide better control over classified infi>rmation than 
traditional systems and allow users from a diverse 
population access to information they need while 
protecting sensitive data. Users want the functionality of 
non-MLS computer systems; graphical user interfaces, a 
rich assortment of software, and electronic connectivity 
with other systems, Compartniented mode workstations 
(CMW) can provide such an environment. An overview 
of secure system architectures and an example MLS net- 
work provide the framework for discussing the risks 
associated with interconnecting MLS systems and 
unclassified networks, and approaches for mitigating 
those risks, A secure Email gateway, using a high- 
assurance (Al) network component, provides the 
necessary sqfeguards for protecting the MLS network 
from external attacks, 

1: System-High, Parallel, or MLS 

Separation of infonnation based upon classification 
has traditionally been accomplished by assigning a 
security classification to the information, creating a 
physically secure perimeter around the infonnation, and 
requiring appropriate security clearances of all persons 
needing access to this information. Such "system-high" 
environm^ts ignore the different classifications of the 
data by requiring that all data protected by the system be 
treated as though it is classified at the highest level 
authorized for the system. Only a person with a clear- 
ance that dominates (is the same or higher than) the 
security classification of the most sensitive infcmnation 
protected by the perimeter is allowed to enter. Once 
inside the perimeter, access is granted only to the specific 
information that the person needs. Electronic data 
processing systems (i.e., computers) used within a 
system-high computer networic traditionally do not 
provide the mechanisms or the assurances necessary to 
identify die security level of each piece of its information 
or to guarantee that information at a lower classification 
cannot become contaminated by information of a higher 
classification. As a result, all infonnation imported into 
a system-high computer must be treated as though it 



contained infcRmation at the system's highest level. In 
Figore 1, unauthorized users are ix>t allowed into the 
security perimeter and uraeviewed data is not allowed 
out 

System-high envirtHunents work well if all people 
nee(£uiig access to the data are already cleared at the 
highest level the system and if the data is not exported 
to a system with a lower classification. The procedures 
used for validating the security level of infonnation 
stored in a system-hig^ computer when the information 
is exported are both time-consunung and unreliable, and 
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Figure 1. System-High Envinonment 



are unnecessary if the computer system is trusted to 
maintain the identification and separation of the 
classified infcrmaticm. System-high computer systems 
also ignore the real needs of users t>y requiring security 
clearances (at considerable expense), even for those who 
will never need access to classified data. 

Some drawbacks associated with system-high 
environments are addressed by using multiple, single- 
level computer systems as shown in Figure 2. bi Uiis 
environment, a Secret, system-high computer network 
and an unclassified computer network coexist so that the 
users are assured that the unclassified information (stored 
on the unclassified network) always remains unclassified. 
Computers and terminals connected to the unclassified 
network may be located in unsecured areas and accessed 
by uncleared persons without fear of compromising die 
Secret computer system. Problems arise when a user 
needs to incorporate infonnation from both systems into 
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a single document It may be necessary to include 
unclassified information from a Secret document in an 
unclassified repoft, or to include unclassified infcmnation 
£rom an unclassified document in a Secret report The 
{ffocedures for nraving data fiom one system to another 
are not convenient and may require re-entering the data 
manually. Fielding multiple single-level networks also 
has its drawbacks due to the expense of the duplicate 
6qu4>ment This scheme is unattractive to users who 
must access data stored on both networks because they 
must switch between multiple terminals to accomplish 
their job. These drawbacks increase as more types oi 
classified data (e.g.. Top Secret or special access) are 
introduced into the environment 

The solution resides in building computer systems 
with sufficient features and trust to q}propriately kd)el all 
information with the appropriate security classification 
and keep the information separated while allowing access 
only to those persons authorized for the data. The 
National Omiputer Security Center (NCSQ published 
the Trusted Computer System Evaluation Criteria 
CrcSEC) [1] to categorize computer systems based on 
the features and assurances they provide to protect 
sensitive data. The NCSC evaluates computer products 
against the criteria contained in the TCSEC and 
publishes its results in their Evaluated Products List 
(EPL). A series of related documents {»ovide guidelines 
for applying the standards of the TCSEC to computer 
products. MLS systems are evaluated against the TCSEC 
criteria for Bl, B2. 83, or Al ratings. Computer 
networking components are also evaluated using the 
guidance of the Trusted Netwc^ Interpretation (TNI) 
[2]. The functional differences between the towest MLS 
rating of 31 and the highest rating of Al are minimal; 
however, the probability that security flaws exist in the 
system decrease dramatically as the system rating goes 
from 31 to Al. Most of the differences between 31 and 



Al systems are found in the way diey are designed, 
architected, analyzed, and tested. The higher the 
evaluation rating, the better the system is at assuring that 
the security polky enforced by the system caimot be 
compromised. 

The parallel network configuration £rom Figure 2 can 
be re-architected by using MLS technology to provide the 
appropriate data to authorized users in sqpproved 
locations without the problems of duplicated equipment 
or manual downgrade between system. This redesigned 
MLS configuration is shown in Figure 3. Each MLS 
computer is configured to protect an appropriate range of 
data and each interface into the computer is likewise 
assigned a security range. In this environment, an 
uncleared user outside of the physically-secured Secret 
and Confidential facilities is still able to access 
unclassified information from all three computer systems 
shown widiout being able to access any classified 
information. Users within the Omfidential security 
perimeter can likewise access any unclassified 
information, but are also able to get to any Confidential 
data in their own facility or the Secret facility, if 
necessary. Users in the Secret facility have access to all 
of the information stored within the MLS computer 
system. 

2: Problem Definition 

Users within the Department of Defense (DoD) want 
the same capabilities found in today's conmiercial, off- 
the-shelf (COTS) products but must have access to both 
unclassified and classified information. Few high- 
assurance components (B3 or Al) exist today and these 
do not yet siqTport the rich suite applications available 
for non-MLS computer systems (e.g., graphical user 
interfaces and integrated ofBce automation software). 
CMWs provide users with a graphical user interface 
(trusted X windows), a widely-used operating system 
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(UNIX) rich with COTS applications^ and the ability to 
woik within an MLS environment (at the Bl tevel of 
assurance). We look at one example of a very large MLS 
network to examine the issues related to MLS computer 
systems. 

The United States Army and National Guard units are 
cunendy undergoing a dramatic change in the way they 
perform their duties [3]. The Reserve Component 
Automation System (RCAS) consists of computer 
hardware and software being fiekied to provide a 
c(Hnmon« integrated suite of office automation tools 
electronically linking all of the units. Additionally, a 
number of RCAS ^plications are being developed to 
electronically replicate the paper forms system curr^tly 
in use [4]. RCAS must interconnect military computer 
systems at over five thousand sites scattered throughout 
the United States. These sites are connected with end-to- 
end encryption over bodi dedicated and dial-up lines and 
rely upon electronic mail as the primary communications 
service between sites. 

Like many DoD computer systems, RCAS contains 
mostly unclassified information; however, some 
classified data must also be accessible to those with both 
the necessary clearance and need for the data. Running 
RCAS as a Secret, system-high network was ruled out 
because it would require extensive site modifications and 
Secret clearances for all of the Army Reserve and 
National Guard personnel. A parallel network 
architecture was also considered, but RCAS was designed 
and fielded as a multilevel computer system in order to 
provide the required functionality, notably a user must be 
able to perform all functions from a single point of entry. 

As an MLS network, RCAS was required to comply 
with government regulations specifying the trust- 
worthiness of computer components when selecting the 
equipment. The NCSC publishes the Computer Security 
Requirements, Guidance for Applying the Department of 
Defense Trusted Computer System Evaluation Criteria in 
Specific Environment [5], or the ** Yellow Book", for use 



in determining the appropriate ratings (Bl through Al) 
needed by computer components used in MLS systems. 
Table 1, below, presents a concise summary of the 
Yellow Book's guidance based upon the highest 
classification of data protected by the system and the 
lowest clearance level of a potential user of that system. 
The rows in the table represent the minimum clearance 
needed by a user of the system. The columns represent 
the highest classification of data protected by the system. 
The intersection of a row and a column identifies the 
minimum trust (TCSEC category) required fcr the 
system. It is clear from this table that a rating of Bl or 
higher is required whenever the minimum user clearance 
is less than the highest data classification; however, there 
are few cases where a Bl system is sufficient There are 
also environments identified by the Yellow Book where 
the NCSC believes that even the protection of an Al 
system is inadequate. 

RCAS meets the Yellow Book*s guiddines by using a 
Bl computer system, accessible by personnel with at least 
a Confidential clearance, to protect data classified at no 
higher than Secret Many DoD computer systems have 
similar user and data requirements and could benefit 
from the advantages of a CMW architecture. But what 
happens when you must exchange data with another 
computer system where the potential users of one system 
do not meet the clearance requirements of the other 
system? Some additional procedures, controls, or 
security assurances are needed. 

3: Analysis of the Problem 

People served by today's computer networks do not 
exist in isolation. They must be able to communicate with 
other computing systems. Before the introduction of 
MLS systems, the only secure way to interconnect 
systems at different security levels was through an "air 
gap" between the systems - copy the information off one 
system and then onto the other system. Some systems 
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were configured to provide a "one-way" write-up from 
die low system to the high system while blocking any 
transmission going in the other direction. Others 
required a person in the loop to manually review all 
information as it passed through the data exchange point. 
MLS computer systems provide the c^)ability to directly 
into^nnect systems at different security levels provided 
the MLS system architecture satisfies the requirements d 
both systems. If die MLS computer system does not 
provide enough assurance, as in Figure 4, then an 
additional device may be needed between the computer 
systems to provide additional protection. 

As shown above, Bl CMWs do not provide die 
assurance necessary to reliably separate Secret data fiom 
uncleared, potentially unknown users. Given an MLS 
networic with all users cleared at least to the Confidential 
level, and a C2, unclassified networic, it is unlikely that a 
direct electronic connecdon will meet the security needs 
of eitfier system. The MLS network must have additional 
protection from die uncleared users of the C2 network. It 
is necessary to analyze die security risks involved before 
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an acceptable solution to this interfacing problem can be 
identified. The following risks were idendfied as key 
concerns in die case of die RCAS network: 

1 . unaudiorized access to die MLS network, 

2. accidental disclosure of classified information 
from the MLS network, and 

3. malicious software imported into die MLS network. 

Access to die RCAS system is controlled by die 
idoitification and authorization mechanisms of the 
CMWs. These mechanisms are implemented by 
validating a user's name and password before access is 
granted. Because die RCAS user group is restricted by 
physical means to a closed set of known users, additional 
measures (e.g., smart cards or scanners) are not required. 
Adding an external interface to MLS network presents 
the problem of unknown, uncleared persons attempting to 
gain access by guessing user account names and 
passwords. There have been many highly-publicized 
examples of successful attacks on interconnected 
computer systems are known. In one example [6], Bill 
Cheswick describes watching and trapping an intruder 
who believes he has accessed a classified military 
computer by successfully guessing usemame4>assword 
combinations. Bill concludes that "if a hacker obtains a 
login on a machine, there is a good chance he can 
become root sooner or later. There are many buggy 
programs diat run at high privileged levels diat offer 
opportunities (at a cracker. If he gets a login on your 
computer you are in trouble." This form of attack must 
be prevented widi a high degree of assurance if 
unauthorized (or unknown) people can attempt to gain 
access into the MLS computer system. 
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Accitotal disclosure of classified infonnation can be 
caused by careless users or by flaws in the security 
mechanisms of the computer systems. These pfoblems 
become worse when an attacker enters the system and is 
then able to exploit the real user or security flaws to trick 
the system into disclosing additional unauthorized 
infonnation. An unclassified external interface 
introduces the possibility <d Secret data falling into the 
hands of uncleared personnel. The external inter&ce 
must not only prevent the attacker firom getting inside the 
MLS network, but must also minimize and control the 
paths that could result in classified infonnation leaking 
out to the external system. Any paths that cannot be 
closed must be identified and their risks understood. 

Virus and Trojan horse software arnipromise the 
integrity and security of ccHnputer systems. The concern 
to RCAS is that an unknown p^son may attempt to 
insert this type of software into RCAS via an external 
system interface. Many paths are available in UNIX 



computer systems. Three approaches were identified by 
the team to solving this type of interface {Roblem: 

1. Upgrade the CMW network to provide higher (B2 
or above) assurance, 

2. Interconnect the CMS network to external systems 
with Bl -assured gateways, 

3. Interconnect the CMW netwoik to external systems 
widi higher (B2 or above) assurance gateways. 

The first approach was the piefened solution. 
Unclassified computer systems with uncleared (but 
authorized) users could be directly connected to the 
CMW network while maintaining the security assurances 
required by the Yellow Book. There woe, however, 
three problems with this ^roach. First, there were no 
COTS products available providing the required 
functionality (able to run an integrated office automation 
suite of software, GOSIP-compliant, etc.) evaluated by 
the NCSC at the B2 level of assurance. Second, most of 
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computer systems for installing executable code. Some 
methods require accessing an existing account on the 
computer, while others rely on system services or flaws to 
create c^nings. Cliff Stoll details many techniques used 
by die wily hacker [7]. Not all attacks come directly 
from an attacker. A program released by an attacker can 
work its way through the maze of computers looking for 
weaknesses. One technique used by the Internet worm [8] 
involved exploiting a debug "feature" of the sendmail 
program to gain information about netghbcHing 
computeis. This type of threat must be addressed by any 
interface into the secure computer system. 

4: Searching for a Secure Solution 

A team of experts was called to brainstorm a set of 
desires expressed by the RCAS user community, not the 
least of which was Email connectivity with non-RCAS 



the currently evaluated MLS products did not include the 
networking protocol stack CTCP/IP, or TP4/CLNP) as 
part of their evaluated product Third, users of an 
external computer system may be unauthorized, 
increasing the requirement to at least the B3 level of 
assurance. 

The second approach involved configuring an existing 
RCAS CMW as an Email gateway (See Figure 5). This 
gateway would disable all non-Email services and 
provide selective filtering of Email messages based upon 
configurable parameters. Unauthorized access to the 
gateway would be controlled by not allowing user 
accounts and by disabling any interactive daemons that 
an attacker could exploit Accidental disclosure of 
classified infonnation would be handled by reducing the 
number of paths (services) out of the CMW network 
(only SMTP service), hitroduction of malicious software 
into the CMW network would be controlled by scanning 
Email message header fields for specific characteristics 
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(eg., messages to pipes or files) and by disabling services 
that an attacker would normally use to insert software 
(e.g., TELNET, FTP). 

This approach provides the necessary functionality, 
but concerns were raised about the trustworthiness of a 
Bl solution. Restricting a Bl computer to the a minimal 
set of processes needed to implement the gateway does 
not increase the Bl computer's security assurance to the 
same level as a B2 or B3 platfonn* It was believed that 
attempts to penetrate the CMW network by bypassing the 
security mechanisms oi the gateway, or attempts to 
compromise the gateway itself and then use it to attack 
the CNfW network, bad a high probability of success. 

The third approach involved configuring a COTS B2 
platform as the Email gateway. This solution has all of 
the advantages of the Bl solution along with some 
additional assurance that the gateway would resist more 
aggressive attacks. One drawback to this solution is the 
small number of evaluated B2 (or higher) products 
available that couM be used for the gateway platform. 
An analysis of the available platforms was needed* along 
with the effort required to create the gateway. Additional 
information was needed about how the gateway would 
block the identified threats, and how any residual risks 
could be mitigated. 

5: Problem Revisited, a Candidate is Found 

Building an Email gateway using a general-purpose 
MLS host computer looked like a straight-fOTward task, 
but a closer look provided some useful insight Bl 
computer products differ from C2 (non-MLS) computer 
products primarily in the functionality inovided to 
support mandatory access controls. Security labels, user 
and device access ranges, auditing, and the concept <^ a 



Trusted Computing Base (TCB), comprise the major 
chaiiges to the system. What is provided by B2 through 
Al products but lacking in Bl products are the 
assurances (structuring and minimizing the TCB to 
support more rigorous architectural analysis and testing, 
fonnal spedficaTions, code-to-q>ecification mappings, 
even infcmnal or fonnal proofs d correctness) that if the 
component starts in a secure state, it will remain secure. 

Without diese added assurances, it is likely that the 
same flaws found in today's general-purpose computers 
also exist in the current Bl products. The very nature of 
a general-purpose computer implies that its configuration 
is easily thodifiable, providing numerous ways an 
attacker can change the state of the computer. The large 
size of the TCB software in a Bl computer makes it 
difficult (if not impossible) to identify and understand all 
of the possible vulnerabilities available to an attacker. 
Building a secure Email gateway with tiiis type of an 
architecture leaves many potential security holes 
unrelated to die gateway ai^lication itself. 

We were Hying to maintain die security of the syst^ 
while satisfying die desires of die users by developing a 
high-assurance application using a low-assurance, 
reprogrammable MLS product widi an unevaluated 
protocol stack and then trying to convince ourselves it 
was trustworthy enough. The ideal solution was to 
integrate a dedicated high-assurance gateway application 
inu) a high-assurance, non-reprogrammable, trusted MLS 
protocol stack. The Al -evaluated Boeing Multilevel 
Secure Local Area Netwoik (MLS LAN) proved to be an 
ideal host for developing the Email gateway. 

The MLS LAN is a data communications network 
system component detennined by die National Security 
Agency to fulfill the Al requirements for mandatory 
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access control, discretionary access control, identification 
and authratication, and audit, as defined in ^>pendix A 
of the TNI. The MLS LAN provides access^cmtrolled 
communications between attached devices where these 
devices may be cq)erating at different sensitivity levels. 
The MLS LAN is comprised of multiple Secure Network 
Servers (SNS), a transmission medium, and an SNS 
configured as a Network Management node (Hgure 6). 
The Al-evaluated MLS LAN product provides many 
additional capabilities not discussed in this paper. For a 
complete descripticm of the product, the interested reader 
should obtain a copy of the NCSC Final Evaluation 
Report [9]. 

The basic unit of the MLS LAN is the SNS, which is 
expandable to support various conrigurations of 



multilevel or single-level (labded or unlabeled) devices. 
Security labeling over the IP interface supports the 
Common IP Security Option (CIPSO) [10] protocol. The 
security label of every datagram is checked as it enters 
the SNS. Security labels are removed from datagrams 
before they are sent to unlabeled interfaces, and are 
added whoi they are received &om unlabeled interfaces. 

The security model of the MLS LAN architecture 
relies upon the separation of processes. Inter-process 
communications, task scheduling, and hardware interface 
accesses are controlled by the security kernel and the 
processOT's hardware protection mechaoiisms. Processes 
outside of the network trusted computing base (NTCB) 
are further restricted to a subset of the kernel services. 
All processes are pre-defined with their code, the initial 
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subscriber devices. The core SNS is comprised of a 
chassis with shared memory board and an SNS processor 
(SNSP) providing the Ethernet trunk interface. To 
achieve a desired level of functionality, subscriber device 
interface cards are added to the core SNS. A 
combination of hosts, video devices, and/or terminals 
can be connected to SNSs in various configurations. Of 
specific interest for the RCAS gateway is the IP interface 
card. This configuration provides static routing of IP 
datagrams from host to network or from network to 
network (Figure 7). 

Hosts or remote networks that use the IP protocol may 
be connected to an SNS over an interface. IP interfaces 
may be configured as a host or router interface providing 
standard IP service. They can be configured as either 



value of the data and special segments, and all initial 
memory requirements known at system build time. 
Dynamic process creation is handled by a template 
mechanism with only a common code segment and 
constant, read-only data are shared among processes. 
Separate read/write data, stack, and special system 
segments are allocated and initialized from the template 
for each dynamically created process. There is no 
mechanism for modifying or adding code to the MLS 
LAN. Figure 8 shows how the MLS LAN software is 
partitioned on a typical device interface processor. 

An MLS LAN must include one SNS designated as 
the Network Management (NM) node. This node 
includes network management software that maintains 
the network configuration database, supports dynamic 
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network Feconfiguratk)!!, collects audit lecofdSp supports 
netwQik troubleshooting, provides administrator console 
interfaces, and allows iap& backup of configuration and 
audit information. All netwoik administration functions 
are suppcffted by NM-attadied consoles and are not 
accessible via network-attached hosts or terminals. 

Each SNS also includes one card that handles the 
configuration, audit, and monitor data between the NM 
node (or the kx:al NM processor card) and the rest of the 
cards within an SNS. This card, called the SNS 
Processor (SNSF). also provides the internet protocol 
processing, packet labeling, mandau^ access control 
validation, and the trusted components of TCP 
(multiplexing, demultiplexing, and addressing) for the 
MLS LAN network. A separate device interface 
processor can be added to an SNS to provide an interface 
between an MLS LAN network and an IP host or an IP 
netwoik. When configured as an IP router, this interface 
maintains a virtual ccmnection table, discretionary access 
control (DAC) tables, and an address resolution protocol 
(ARP) table. It also utilizes the netwoik routing table, 
security labeling information, and other configuration 
parameters downloaded firom the Netwoik Management 
node at initialization time. 

The MLS LAN supports terminals, hosts, and 
woikstadons by adding subscriber device interface 
processors to the SNS. The host interface processor 
supports multiple host users by spawning separate, 
single-level protocol modules to support each connection 
request. This allows users to create multiple TELNET 
sessions over the MLS LAN netwcnk. These sessions 
may be created at difTerent security levels based upon the 
user's clearance, the security range assigned to the user's 
host, and the security range (or level) of the remote host 

While this protocol software was dcvtlagcd to the 
same standard as the rest of the MLS LAN software, its 
correct operation is not required for maintaining the 



security of the netwoik. In keeping with the high- 
assurance requirements of the TCSEC, specifically TCB 
minimization, this protocol software was placed outside 
of the NTCB. Because it is outside the NTCB, it can be 
nKxlified or r^laced without affecting the Al rating of 
the MLS LAN. 

What makes the MLS LAN unique is the high degree 
of confidence that classified information within its 
network will not be compromised [1 1]. In addition to the 
functional requirements of MLS products, an Al- 
evaluated product must be architected so that the security 
relevant code can be analyzed, understood, modeled, 
tested, and verified. The TCSEC defines the Al 
assurance requirements for system architecture, system 
integrity, covert channel analysis, trusted recovery, 
security testing (including vendor testing, evaluation 
team functional testing, and penetration testing), design 
specification and verification (including a formal model, 
descriptive top-level specification, formal top-level 
specification, and specification-to-code mappings), 
trusted distribution, and configuration management 
This degree of assurance is not possible with the 
architecture and size of today's Bl and B2 products. 

6: Designing the Email Gateway 

The evaluated MLS LAN product provides much of 
the desired functionality iar the gateway. As a high- 
assurance, non-user reprogrammable system, it 
represents the best protection available against direct 
attacks. The MLS LAN*s IP DAC feature makes it 
possible to configure two IP interfaces so that only the 
TCP Email (SMTP) port is available, while eliminating 
interactive services like TELNET and FTP. The IP DAC 
capability can also be used to control which hosts are 
allowed to use the gateway: in the case of RCAS, only 
one host on eittter side of the gateway is allowed access. 
The threat of unauthorized access to RCAS is effectively 
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eliminated by the gateway. The SMTP mail service is the 
only path for accidental disclosure, or insertion of 
malicious software, and an attacker will be unable to 
modify the gateway's configuration to create additional 
paths. The key advantages of this solution are that the 
potential paths compromising the RCAS system are 
reduced to a minimal set and the solution provides high 
assurance that it can protect itsdf agamst comiHomise. 

Some additional protection was needed to i^event 
attacks directed at the RCAS process (sendmail) that 
services the TCP mail port Numerous security flaws 
have been exploited in the past by using features oi the 
sendmail program and it is naive to assume that new 
techniques will not be discovered. One recent example 
discussed on the Internet involved sending mail to pipes 
(executable programs) or fdes. These kinds of threats 
must be handled administratively at each site. Any 
penetration attempts of this type coukl not be blocked by 
the MLS LAN IP router interfoce without scHne 
modification. 

Another concern was that an uncontrolled Email 
interface to "the rest of the world** would invite a flood of 
unwanted junk mail. It was feared that if user access was 
not controlled, then anyone from the outside could 
attempt to use the Email path to insert Trojan horse or 
vims software into RCAS. Tberefwe, some controls were 
needed to define the set of users who would be allowed to 
send and receive mail through the gateway. 

Some extensicxis to the evaluated MLS LAN product 
were needed to provide this additiorud insulation between 
the two networks. Using the concept of multiple, single- 
level software (Hotocol modules (much like that used for 
TELNETAX3>) we added an SMTP component to the IP 
router software. We minimized the impact of this 
software on the security oi the MLS LAN 1^ building it 
outside of die NTCB. A new process is spawned for each 
TCP Email session request and is assigned the same 



security label as die TCP connection. This pnxess 
receives the header and body of the mail message, 
validates it against a set of NM-configured parameters, 
and stores it widiin the gateway until the connectian is 
closed. When die mail tiansacticHi is complete, die first 
SMTP process is terminated and a second process is 
spawned to fonmd die mail to the destinadon host The 
second SMTP process is created at eidier die security 
level of tbe EmaU message (die same as the first SMTP 
process) or at die level of die receiving host if die host's 
minimum level is higher dian that of the data. This 
design is similar in concept to a low-to-high guard 
described by Michelle Gosselin [12] except diat the 
allowed qyplicadon layer protocols are restricted to 
SMTP, die Al platform is more resistant to penetration, 
and die networking protocol stacks are included as part 
of die evaluated i^oduct 

The two-stage, short-term store- and-forward sequence 
insures diat an attacker cannot create an interactive 
connection direcUy with the process servicing the SMTP 
port on an RCAS CMW. Aldiough not currendy part of 
the RCAS plan, this design make it possible to use the 
Email Gateway to perform blind Email write-ups from a 
law system to a higher level system (e.g., from die Bl 
MLS network to the Top Secret netwwk, Figure 4), while 
preventing covert information from flowing back to die 
low side from die high side. 

Access control lists (ACL) were added to restrict die 
users oi the gateway by defining the set of mailboxes 
(e.g., usei@host) allowed or exchided from sending or 
receiving mail through the gateway. SMTP mail sessions 
identify the receivers d the mail in *Ycpt_to:** commands 
and identify die send in the "mail.from:" command. The 
entire mail transaction is rejected if the sender is not 
audiOTized by die sender ACL. An *^nknown user'* 
response is returned for any receivers not audimzed by 
die receiver ACL, and they are ddeted from die list of 
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receivers forwarded to the receiving host The 
administrator can also specify that the sender and 
receiver fields do not address processes, files, or relays. 

The ''data" section of the SMTP transaction ccmsists 
(tf the RFC-S^ header followed by the body <tf the niail. 
Some header fields are used to route undelivetable mail 
back to the sender. Error rejections inesent a security 
problem if the errors can be routed to processes or files 
on a targeted host These header fields (enors-to:, ficxn:, 
resent-fiom:, sender:, lesent-senden, and return-receipt- 
to:) are checked to verify that the targeted host is an 
author i zed host on the (niginating side o£ the gateway. 
Attempts to send rejections to processes, files, <v relays 
can also be checked. Mail rejected by the gateway are 
returned to the sender in the usual RFC-^ manner* 
except in the case ci a bw-to-high write-up, where the 
response could transmit classified information back to the 
original sender. No response is returned to the low side. 

Some other minor changes to the MLS LAN were 
needed to support the Email gateway. The itetwoik 
management software was modified to support user 
access control lists, tiie new Email gateway device 
configuration parameters, and new audit events, and the 
MLS LAN's internal hard disk file system was modified 
to add a second partition for tempcRarily storing the 
Email data. 

7: Is it Good Enough? 

The Al Email gateway provides the capabilides 
needed by RCAS to exchange Email with external 
unclassified computer systems yMi^ protecting itself 
fiom attacks. Although it will be used to ccnmect 
undasfflfied external systems to an unclassified pent in 
RCAS, could be used to ccxmea Confidential, Secret, or 
MLS computers to RCAS. The Al Email gateway is 
being designed and tested to withstand even the most 
s(^)histicated attacks without compromising its security. 

Can the gateway keep unknown people who gain 
access to the external system fiom attacking RCAS? It 
certainly minimizes the types of attacks that can be 
attempted, but part of the responsibility still resides 
inside the MLS network. Interactive attacks aimed at 
tiying to find user account names and passwords and 
then using diis information to log into the computer (a 
common technique used by attackers) are not possible 
through the gateway. Trojan horse and virus software 
cannot gain entrance through FIP or similar file transfer 
protocols. Use of Email id insert Trojan horse or virus 
software into the MLS network requires the cooperation 
of the person receiving the Email to activate the software. 
The risk still remains that information from inside the 
MLS netwoik could leak out in an Email message 
through the gateway. Education, policies, and 



procedures are still the best ways to reduce this risk until 
higher assurance platfcmns are developed that suppoit 
general purpose, COTS application software. 

Developiqg RCAS as a very large, distributed MLS 
system using CMWs, secure OSI communications 
protocols, an integrated suite of ofGce automation 
software, and custom RCAS applications to replace paper 
forms, pushes die technology of secure computing beyond 
anything previously fielded, (figh-assurance MLS 
oompcments, like the Al Email gateway, can be used to 
extend the ciq[H&lMlities of secure computer systems and 
are becoming an important part of MLS computer 
systems. This trend will continue as the use of MLS 
computer equipment grows, however, we recognize that 
high-assurance components like the Al Email gateway 
described in this paper caiuiot protect low-assurance. 
MLS systems finxn their own weaknesses. The ultimate 
solutk)n for Bl nowoiks like RCAS must eventually 
include an upgrade of the computing equipment to 
provide a hig^ level of security assurance. 
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